White Wine Quality Analysis by Suliman Rashed

This dataset is public and available for research purposes only.
The link to the website is here: http://www3.dsi.uminho.pt/pcortez/wine/.
Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The dataset has 13 variables and it’s divided into two parts. First, the physicochemical section with 11 variables. Second, the quality variable for the score of the expert from 0 to 10.
To use the quality variable as classification, a factor version of this variable is needed.

df$quality.factor <- as.factor(df$quality)
str(df)
## 'data.frame':    4898 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.factor      : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...

Univariate Plots Section

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality      quality.factor
##  Min.   : 8.00   Min.   :3.000   3:  20        
##  1st Qu.: 9.50   1st Qu.:5.000   4: 163        
##  Median :10.40   Median :6.000   5:1457        
##  Mean   :10.51   Mean   :5.878   6:2198        
##  3rd Qu.:11.40   3rd Qu.:6.000   7: 880        
##  Max.   :14.20   Max.   :9.000   8: 175        
##                                  9:   5

To clarify these charts:
- Blue Line: Specify the median.
- Red Line: Specify the mean.
- Dashed Black Line: Specify the quantiles.
- Dashed Orange Line: Specify the percentile (0.01 & 0.99).
We can see that all of these measures have a (almost) normal distribution and some outliers but doesn’t affect the distribution.

Since we have a categorical variable for quality, let’s focus on divide it into 3 categorical specifications (A is the best & C is the worst).

# creating a new variable (quality catagories) "A" between 10-8, "B" between 7-5
# "C" between 4-0
df$quality.catagories <- ifelse(df$quality >= 8, "A",
                                ifelse(df$quality >= 5, "B",
                                       ifelse(df$quality >= 0, "C",
                                                            "Other")))

After that, we will use one of these categorical specifications to draw plots with physicochemical variables to know the specifications of the best white wine.

ggplot(aes(df$quality.catagories), data = df) +
  geom_histogram(stat = "count") 

Above is a histogram that plot the qualit.catagories.

It appears that most of the features for the best wine showed a normal distribution except the residual.sugar and alcohol. Also there are some positive outliers in almost each of these specifications.

As we can see there is a long tail in the first plot and most of the data are under 20. When we did Log10 on the second plots you can spot the difference clearly.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 13 variables (11 physicochemical variables, one categorical variable and one variable for indexing) and 4898 observations. The 11 physicochemical features are: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density ,pH, sulphates, and alcohol. and all of them are numeric, while the remaining variable is quality and even when it is written with numbers but it represented as categorical since it is ratings by the experts.

What is/are the main feature(s) of interest in your dataset?

The main feature is the quality variable since it represent the ratings of the experts.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

There is chlorides, density, and alcohol.

Did you create any new variables from existing variables in the dataset?

Yes. quality.catagorical from quality.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

In residual.sugar there was a long tail with most of the data were under 20. So I did a log 10 transformation on its x-axis to see the transformation and it was very clear.

Bivariate Plots Section

We can see from the plots matrix above that the strongest correlation is between quality and alcohol with (r=0.436). Also, we can spot that the correlation between density and residual.sugar is strong positive with (r=0.839) and density and total.sulfur.dioxide with (r=0.53). And in terms of negative correlation, the correlation between density and alcohol is (r=-78), and between density and quality is (r=-0.307).

As we can see from the matrix box plots that there are two features that really affect the quality and the ratings of the wine. alcohol with a strong positive and density with a strong negative. Other features don’t have that much effect as these two with quality. Let’s focus on it.

From the density plot we can see that if the density is high we expect to see the rating lower. On the other side, as long as the alcohol is high we expect the rating of the quality is high.

## r = -0.7801376

Lets plot some of the other chimical features to see the relationship between them:

density and residual.sugar

## r = 0.8389665
density and total.sulfur.dioxide

## r = 0.5298813
alcohol and residual.sugar

## r = -0.4506312

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We saw above that there is a strong correlation between quality and alcohol with (r=0.436). Also, we saw that the correlation between density and residual.sugar is strong positive with(r=0.839) and density and total.sulfur.dioxide with (r=0.53). And in terms of negative correlation, the correlation between density and alcohol is (r=-78), and between density and quality is (r=-0.307).

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes. there was between alcohol and density and between residual.sugar and density also.

What was the strongest relationship you found?

It was between residual.sugar and density with (r = 0.839).

Multivariate Plots Section

With a strong negative relationship between alcohol and density, we can see from the plot above that the best wine tend to have much alcohol, and the worst tend to have higher density.

With holding density in both plots, we can see in the left plot that a better wine have more residual sugar. On the right side we can see also that an alcoholic wine have more residual sugar.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We can see that the best wine tend to have much alcohol, and the worst tend to have higher density. ALso, we can see that a better wine have more residual sugar and an alcoholic wine have more residual sugar.

Were there any interesting or surprising interactions between features?

No, I didn’t see anything.


Final Plots and Summary

Plot One

Description One

Since the correlation between density and quality is negative with (r = -0.3071233) we can see from the plot above that when the density is high we expect to see the quality to be low.

Plot Two

Description Two

We can spot from the plot that the correlation between residual.sugar and density is string positive with (r = 0.838). So as long the residual sugar is heigh we excpect the density to be higher.

Plot Three

Description Three

With a strong negative relationship between alcohol and density (r = -0.7801376), we can see from the plot above that the best wine tend to have much alcohol, and the worst tend to have higher density.


Reflection

It was an interesting project and an interesting dataset to investigate it. The gradation of the project was fascinating, starts from univariate to bivariate to multivariate. Knowing which chemical property has the highest effect on wine’s quality. Also between the properties itself and how much that have an effect on the quality in the end.
What I think it would give more accurate is to have more observations  wines to test since we don’t have any wines that rated with 10  10.
In the future, we could implement these modules on other sets, maybe the red wines that in the description of this set. Also, we may go deeper into this data set and explore more relation between the chemical properties.